What do we mean by data wrangling?
Lots of possible definitions, but generally means all the transformations you might need to do to get data into a form that is ready for further analysis.
Create a new column based on calculations in your data
Select only certain parts of your data
Do calculations on subsets of your data
Modify text data
Iterate over your data
Merge data from multiple sources
Create a new column based on calculations in your data: mutate()
Select only certain columns or rows of your data: filter(), select()
Do calculations on subsets of your data: group_by(), summarise()
Modify text data: stringr functions
Iterate over your data: for loops, apply(), purrr::map()
Merge data from multiple sources: join()
I just downloaded it for the first time.
I use it, but it makes me uncomfortable.
I’m comfortable with it, here to learn about the tidyverse.
No idea what you’re talking about.
I’ve heard of and used these things, but they’re a stretch.
Easy peasy.
A collection of R packages designed for data wrangling and visualization, built for tidying your data and working with tidy data.
library(tidyverse)Elegant, consistent way of organizing data
First argument to each function is typically a data frame (consistency)
More like reading English than base R code
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table. (https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)
An untidy example:
| names | favorite_food | favorite_color |
|---|---|---|
| Adrienne | burritos | yellow |
| Alex | pizza | turquoise |
Tidied:
| names | variable | value |
|---|---|---|
| Adrienne | favorite_food | burritos |
| Alex | favorite_food | pizza |
| Adrienne | favorite_color | yellow |
| Alex | favorite_color | turquoise |
Sometimes messy can be good!
Data entry
Data with matrix operations (e.g., statistical modeling)
Nice to know your options
Read data using read_csv().
ufo <- read_csv("../data/ufo_dat.csv")## Parsed with column specification:
## cols(
## datetime = col_character(),
## city = col_character(),
## state = col_character(),
## country = col_character(),
## shape = col_character(),
## `duration (seconds)` = col_integer(),
## `duration (hours/min)` = col_character(),
## comments = col_character(),
## `date posted` = col_character(),
## latitude = col_double(),
## longitude = col_double()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 266 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1606 duration (seconds) no trailing characters .5 '../data/ufo_dat… file 2 1653 duration (seconds) no trailing characters .5 '../data/ufo_dat… row 3 1660 duration (seconds) no trailing characters .1 '../data/ufo_dat… col 4 1683 duration (seconds) no trailing characters .5 '../data/ufo_dat… expected 5 2039 duration (seconds) no trailing characters .05 '../data/ufo_dat…
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
ufo1 <- read.csv("../data/ufo_dat.csv")Many structures work, but you should definitely have one.
Consider a “recipe” script.
setwd()read.csv("adrienne/cool_project1/version6/sdfljk.csv")read_csv() vs read.csv()system.time(ufo <- read_csv("../data/ufo_dat.csv"))## user system elapsed
## 0.525 0.023 0.553
system.time(ufo1 <- read.csv("../data/ufo_dat.csv"))## user system elapsed
## 2.722 0.038 2.768
read_csv() is “lazy and surly” (Wickham).
read.csv(): stringsAsFactors = TRUE, uses rownames
%>%head()mean()ufo <- ufo %>%
rename("duration_sec" = `duration (seconds)`)This is the same as:
ufo <- rename(ufo, "duration_sec" = `duration (seconds)`)mutate()
ufo <- ufo %>%
mutate(duration_min = duration_sec/60) %>%
mutate(duration_hr = duration_min/60)
kable(head(ufo[, c("duration_sec", "duration_min", "duration_hr")]))| duration_sec | duration_min | duration_hr |
|---|---|---|
| 2700 | 45.0000000 | 0.7500000 |
| 7200 | 120.0000000 | 2.0000000 |
| 20 | 0.3333333 | 0.0055556 |
| 20 | 0.3333333 | 0.0055556 |
| 900 | 15.0000000 | 0.2500000 |
| 300 | 5.0000000 | 0.0833333 |
ufo <- ufo %>%
mutate(duration_min = duration_sec/60) %>%
mutate(duration_hr = duration_min/60)In tidyverse functions, column names typically don’t get quotes.
?mutatemutate()Challenge: Add a column that gives the duration in years. (Hint: there are 8760 hours in a year.)
ufo <- ufo %>%
mutate(duration_years = duration_hr/8760)filter(), select()
ufo %>%
filter(duration_hr > 10) %>%
select(datetime, duration_sec, duration_hr, `duration (hours/min)`) %>%
arrange(desc(duration_hr)) %>%
head()## # A tibble: 6 x 4
## datetime duration_sec duration_hr `duration (hours/min)`
## <chr> <int> <dbl> <chr>
## 1 10/1/1983 17:00 97836000 27177. 31 years
## 2 6/3/2010 23:30 82800000 23000. 23000hrs
## 3 9/15/1991 18:00 66276000 18410. 21 years
## 4 4/2/1983 24:00 52623200 14618. 2 months
## 5 8/10/2012 21:00 52623200 14618. 2 months
## 6 8/24/2002 01:00 52623200 14618. 2 months
31 years?!
comment <- ufo %>%
filter(duration_sec > 90000000) %>%
select(comments)
print(comment$comments)## [1] "Firstly, I was stunned and stared at the object for what seemed minutes, but probably was only seconds. My first inclination was to bec"
Challenge: Get only the observations shorter than a duration of your choice. Save the results to a new data frame called short_sight.
short_sight <- ufo %>% filter(duration_hr < 1)group_by(), summarise()
For example, how many sightings were there per state?
n_per_state <- ufo %>%
group_by(state) %>%
count(sort = T)
kable(head(n_per_state))| state | n |
|---|---|
| ca | 9655 |
| NA | 5797 |
| wa | 4268 |
| fl | 4200 |
| tx | 3677 |
| ny | 3219 |
for loops, apply(), purrr::map()
When to iterate:
Anytime you need to do something more than twice (ish)
for loopsAdvantages: many other programming languages also have for loops.
Disadvantages: can be bulky and/or slow compared to other options.
for loopsHow many UFOs were observed in each state?
states <- unique(ufo$state)
answer_df <- data.frame(state = states, n = NA)
for(i in 1:length(states)){
temporary_data <- ufo %>% filter(state == states[i])
answer_df$n[i] <- nrow(temporary_data)
}kable(head(answer_df))| state | n |
|---|---|
| tx | 3677 |
| NA | 0 |
| hi | 353 |
| tn | 1193 |
| ct | 968 |
| al | 691 |
apply functionHow many UFOs were observed in each state?
states <- unique(ufo$state)
count_obs <- function(state_id){
nrow(ufo %>% filter(state == state_id))
}
counts <- lapply(states, count_obs)map functionThis is the tidyverse way…
How many UFOs were observed in each state?
states <- unique(ufo$state)
count_obs <- function(state_id){
nrow(ufo %>% filter(state == state_id))
}
ans_list <- map(states, count_obs)
ans_vect <- map_int(states, count_obs)maphead(mtcars)## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))** Example stolen directly from R for Data Science: http://r4ds.had.co.nz/iteration.html#the-map-functions
summary(models[[1]])##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1513 -1.9795 -0.6272 1.9299 5.2523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.571 4.347 9.104 7.77e-06 ***
## wt -5.647 1.850 -3.052 0.0137 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.332 on 9 degrees of freedom
## Multiple R-squared: 0.5086, Adjusted R-squared: 0.454
## F-statistic: 9.316 on 1 and 9 DF, p-value: 0.01374
names(models[[1]])## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
map(models, "coefficients")## $`4`
## (Intercept) wt
## 39.571196 -5.647025
##
## $`6`
## (Intercept) wt
## 28.408845 -2.780106
##
## $`8`
## (Intercept) wt
## 23.868029 -2.192438
** Caveat: May not be a good idea statistically…
join()Are number of UFO sightings per year associated with economic conditions?
join()First, get economic data: ?economics
kable(head(economics))| date | pce | pop | psavert | uempmed | unemploy |
|---|---|---|---|---|---|
| 1967-07-01 | 507.4 | 198712 | 12.5 | 4.5 | 2944 |
| 1967-08-01 | 510.5 | 198911 | 12.5 | 4.7 | 2945 |
| 1967-09-01 | 516.3 | 199113 | 11.7 | 4.6 | 2958 |
| 1967-10-01 | 512.9 | 199311 | 12.5 | 4.9 | 3143 |
| 1967-11-01 | 518.1 | 199498 | 12.5 | 4.7 | 3066 |
| 1967-12-01 | 525.8 | 199657 | 12.1 | 4.8 | 3018 |
join()Make a year column in the UFO dataset. This requires a brief detour into lubridate
library(lubridate)
ufo <- ufo %>%
separate(datetime, into = c("date", "time"), sep = " ") %>%
mutate(date = mdy(date)) %>%
mutate(year = year(date))join()Let’s see what we got out of that:
ufo %>%
select(date, time, year) %>%
head() %>%
kable()| date | time | year |
|---|---|---|
| 1949-10-10 | 20:30 | 1949 |
| 1949-10-10 | 21:00 | 1949 |
| 1955-10-10 | 17:00 | 1955 |
| 1956-10-10 | 21:00 | 1956 |
| 1960-10-10 | 20:00 | 1960 |
| 1961-10-10 | 19:00 | 1961 |
join()We’ll need to know how many UFO sightings there were each year in our dataset.
Challenge: how would you do this?
for loopsapply functionsmap functionsgroup_by() %>% summarise()join()ufo_counts <- ufo %>%
group_by(year) %>%
count()
kable(head(ufo_counts))| year | n |
|---|---|
| 1906 | 1 |
| 1910 | 2 |
| 1916 | 1 |
| 1920 | 1 |
| 1925 | 1 |
| 1929 | 1 |
join()We’ll also want economic summaries by year.
econ_summary <- economics %>%
mutate(year = year(date)) %>%
group_by(year) %>%
summarise_all(mean) %>%
ungroup() %>%
select(-date)join()df_new <- left_join(ufo_counts, econ_summary, by = "year") %>%
filter(year >= min(econ_summary$year))
kable(head(df_new))| year | n | pce | pop | psavert | uempmed | unemploy |
|---|---|---|---|---|---|---|
| 1967 | 188 | 515.1667 | 199200.3 | 12.30000 | 4.700000 | 3012.333 |
| 1968 | 220 | 557.4583 | 200663.8 | 11.21667 | 4.500000 | 2797.417 |
| 1969 | 155 | 604.4833 | 202648.7 | 10.74167 | 4.441667 | 2830.167 |
| 1970 | 147 | 647.6917 | 204982.3 | 12.60833 | 4.983333 | 4127.333 |
| 1971 | 130 | 701.0000 | 207589.3 | 13.25833 | 6.275000 | 5021.667 |
| 1972 | 158 | 769.4333 | 209837.6 | 12.11667 | 6.108333 | 4875.833 |
meltWhat if we wanted to plot all these economic variables at the same time?
meltlibrary(reshape2)
df_long <- df_new %>%
melt(id.vars = c("year", "n"))
kable(head(df_long))| year | n | variable | value |
|---|---|---|---|
| 1967 | 188 | pce | 515.1667 |
| 1968 | 220 | pce | 557.4583 |
| 1969 | 155 | pce | 604.4833 |
| 1970 | 147 | pce | 647.6917 |
| 1971 | 130 | pce | 701.0000 |
| 1972 | 158 | pce | 769.4333 |
meltggplot(df_long, aes(x = value, y = n)) +
geom_point() +
geom_smooth() +
facet_wrap(~variable, scales = "free_x")stringr functionsWhich UFO observations refer to a color?
colors <- c(" red ", " orange ", " yellow ", " green ", " blue ", " purple ", " pink ", " white ", " black ")
ufo_colors <- ufo %>%
mutate(color_ans = str_detect(comments, colors)) %>%
filter(color_ans == T)## Warning in stri_detect_regex(string, pattern, opts_regex = opts(pattern)):
## longer object length is not a multiple of shorter object length
Why the spaces?
ufo_colors$comments[1:10]## [1] "1 object with green and red lights"
## [2] "Object silently traveled north -northwest. It was V shaped with five orange lights on the perimeter and one white ligh in the center of"
## [3] "Observed 2 white clouds of identical shape in clear blue shy with object becoming visible beneath one cloud."
## [4] "Silent, oval, bright white craft in yard."
## [5] "12 ovel objects flying east to west orange and red at a slow speed then vanished"
## [6] "Orange light flies overhead and turns black as it passed silently."
## [7] "Spherical. Red, yellow, and green lights. Below clouds, yet very high up."
## [8] "Bright white light and then 2 blinking red lights in Western Sky"
## [9] "Big extremely bright white light hovering in sky"
## [10] "Red circular object surrounded by yellow glow flying w/no sound quickly coming out of east turning southbound"
colors_ans <- map(colors, function(color){str_extract(ufo_colors$comments, color)})
colors_df <- as.data.frame(colors_ans)
colors_df <- colors_df %>%
unite("colors_included", 1:ncol(colors_df), sep = ";") %>%
mutate(colors_included = str_replace_all(colors_included, "NA;", "")) %>%
mutate(colors_included = str_replace_all(colors_included, ";NA", ""))
head(colors_df)## colors_included
## 1 red ; green
## 2 orange ; white
## 3 blue ; white
## 4 white
## 5 red ; orange
## 6 black
ufo_colors <- ufo_colors %>% mutate(colors = colors_df$colors_included)“You never get better at regex; you just get better at googling.”
What colors are most common in UFOs, and is there a relationship between color and shape?
library(tidytext)
colors_df <- ufo_colors %>%
unnest_tokens(output = colors, input = colors,
token = stringr::str_split, pattern = ";") %>%
mutate(colors = str_replace_all(colors, " ", ""))colors_df %>%
select(date, colors) %>%
head() %>%
kable()| date | colors |
|---|---|
| 1993-10-10 | red |
| 1993-10-10 | green |
| 1999-10-10 | orange |
| 1999-10-10 | white |
| 2007-10-10 | blue |
| 2007-10-10 | white |
colors_df %>%
group_by(colors) %>%
count(sort = T) %>%
ggplot(aes(x = colors, y = n)) +
geom_col() +
coord_flip()but red and orange show up a lot too!
First, we should just get the 10 most common shapes.
Challenge: How would you do that?
shapes <- colors_df %>%
filter(!is.na(shape)) %>%
group_by(shape) %>%
count(sort = TRUE) %>%
ungroup() %>%
slice(1:10)
kable(shapes)| shape | n |
|---|---|
| light | 621 |
| triangle | 319 |
| circle | 317 |
| fireball | 248 |
| sphere | 206 |
| unknown | 200 |
| other | 164 |
| oval | 143 |
| disk | 116 |
| formation | 72 |
What’s wrong with this?
Challenge: How would you do that?
shape_colors <- colors_df %>%
filter(shape %in% shapes$shape)ggplot(shape_colors, aes(x = shape, fill = colors)) +
geom_bar(position = "fill")What do each of these functions do?
mutate()
filter()
select()
group_by() %>% summarise()
str_select() or str_extract()
join()
What are some approaches to iteration?
for loops
apply() functions
map() functions
in simple cases, may be replacable with group_by() %>% summarise()